ABanditNAS: Anti-Bandit for Neural Architecture Search
97
Finally, after K ∗T samples where T is a hyperparameter, we calculate the confidence
with the UCB according to Eq. 4.8 as
sU(o(i,j)
k
) = m(i,j)
k,t
+
2 log N
n(i,j)
k,t
.
(4.12)
The operation with minimal UCB for every edge is abandoned. This means that operations
that are given more opportunities but result in poor performance are removed. With this
pruning strategy, the search space is significantly reduced from |Ω(i,j)|10×6 to (|Ω(i,j)| −
1)10×6, and the reduced space becomes
Ω(i,j) ←Ω(i,j) −{arg min
o(i,j)
k
sU(o(i,j)
k
)}, ∀(i, j).
(4.13)
The reduction procedure is repeated until the optimal structure is obtained, where only one
operation is left on each edge.
Complexity Analysis. There are O(K|EM|×v) combinations in the search space dis-
covery process with v types of different cells. In contrast, ABanditNAS reduces the search
space for every K ∗T epoch. Therefore, the complexity of the proposed method is the
following.
O(T ×
K
k=2
k) = O(TK2).
(4.14)
4.2.4
Adversarial Optimization
The goal of adversarial training [167] is to learn networks that are robust to adversarial
attacks. Given a network fθ parameterized by θ, a dataset (xe, ye), a loss function l and
a threat model Δ, the learning problem can be formulated as the following optimization
problem: minθ
e maxδ∈Δ l
fθ(xe + δ), ye
, where δ is the adversarial perturbation. In this
chapter, we consider the typical l∞threat model [167], Δ = {δ : ∥δ∥∞≤ϵ} for some ϵ > 0.
Here, ∥· ∥∞is the l∞norm distance metric and ϵ is the adversarial manipulation budget.
The adversarial training procedure uses attacks to approximate inner maximization over
Δ, followed by some variation of gradient descent on model parameters θ. For example,
one of the earliest versions of adversarial training uses the Fast Gradient Sign Method
(FGSM) [75] to approximate the inner maximization. This could be seen as a relatively
inaccurate approximation of inner maximization for l∞perturbations, and it has the closed-
form solution: θ = ϵ · sign
∇xl
f(x), y
. A better approximation of inner maximization
is to take multiple smaller FGSM steps of size α instead. However, the number of gradient
computations caused by the multiple steps is proportional to O(EF) in a single epoch, where
E is the size of the data set and F is the number of steps taken by the adversary PGD.
This is F times higher than standard training with O(E) gradient computations per epoch,
and adversarial training is typically F times slower. To accelerate adversarial training, we
combine FGSM with random initialization [247] for our ABanditNAS. Our ABanditNAS
with adversarial training is summarized in Algorithm 8.
4.2.5
Analysis
Effect on the hyperparameter λ. The hyper-parameter λ balances the performance
between the past and the current. Different values of λ result in similar search costs. The
performance of the structures searched by ABanditNAS with different values of λ is used